This tutorial explores implementing the LLM Arena-as-a-Judge approach to evaluate large language model outputs using head-to-head comparisons. It demonstrates using OpenAI’s GPT-4.1 and Gemini 2.5 Pro, judged by GPT-5, in a customer support scenario.
frozen-in-time version of our Paper Finder agent for reproducing evaluation results. This repo contains the code for the standalone Paper Finder agent. PaperFinder is our paper-seeking agent, which is intended to assist in locating sets of papers according to content-based and metadata criteria.
This GitHub repository directory contains resources for evaluating Large Language Models (LLMs), including a Jupyter Notebook demonstrating how to use LLM Arena as a judge and a Python script for the same purpose. It also includes a README file with instructions on how to view the notebook if it doesn't render correctly on GitHub.
MCP-Universe is a comprehensive benchmark designed to evaluate LLMs in realistic tasks through interaction with real-world MCP servers across 6 core domains and 231 tasks. It highlights the challenges of long-context reasoning, unfamiliar tool spaces, and cross-domain variations in LLM performance.
This article discusses methods to measure and improve the accuracy of Large Language Model (LLM) applications, focusing on building an SQL Agent where precision is crucial. It covers setting up the environment, creating a prototype, evaluating accuracy, and using techniques like self-reflection and retrieval-augmented generation (RAG) to enhance performance.
This article discusses the extensive evaluation of quantized large language models (LLMs) by Neural Magic, finding that quantized LLMs maintain competitive accuracy and efficiency with their full-precision counterparts.
- **Quantization Schemes**: Three different quantization schemes were tested: W8A8-INT, W8A8-FP, and W4A16-INT, each optimized for different hardware and deployment scenarios.
- **Accuracy Recovery**: The quantized models demonstrated high accuracy recovery, often reaching over 99%, across a range of benchmarks, including OpenLLM Leaderboard v1 and v2, Arena-Hard, and HumanEval.
- **Text Similarity**: Text generated by quantized models was found to be highly similar to that generated by full-precision models, maintaining semantic and structural consistency.
This article explores various metrics used to evaluate the performance of classification machine learning models, including precision, recall, F1-score, accuracy, and alert rate. It explains how these metrics are calculated and provides insights into their application in real-world scenarios, particularly in fraud detection.
This guide demonstrates how to execute end-to-end LLM workflows for developing and productionizing LLMs at scale. It covers data preprocessing, fine-tuning, evaluation, and serving.
Discusses the trends in Large Language Models (LLMs) architecture, including the rise of more GPU, more weights, more tokens, energy-efficient implementations, the role of LLM routers, and the need for better evaluation metrics, faster fine-tuning, and self-tuning.
Langfuse is an open-source LLM engineering platform that offers tracing, prompt management, evaluation, datasets, metrics, and playground for debugging and improving LLM applications. It is backed by several renowned companies and has won multiple awards. Langfuse is built with security in mind, with SOC 2 Type II and ISO 27001 certifications and GDPR compliance.